Build a basic campaign prioritization model using all relevant variables extracted from the database and identified in previous work.
library(tidyverse)
library(reshape2)
library(gridExtra)
library(splines)
library(lubridate)
library(wranglR)
library(Boruta)
library(foreach)
library(doParallel)
library(glmnet)
library(glmnetUtils)
# Functions adapted from previous analysis steps
source('code/functions.R')
# Visualization functions adapted fron previous analysis steps
source('code/functions_viz.R')
# Set number of available CPU cores
registerDoParallel(detectCores() - 1)
The overarching goal is to predict giving over the final two years of the campaign. Ideally, I’d want to find expected future value, not just difference from expected value today. Consider the following:
\[ E \left( \text{giving, donor | covariates} \right) = E \left(\text{giving | donor, covariates} \right) P \left(\text{donor | covariates} \right) \]
Estimate the expected future value as the product of an expected value and a probability. This can also be thought of as separate capacity and affinity models, and should give more useful estimates than \(E\left( \text{giving | covariates} \right)\), which is left-censored by $0.
It’ll be informative seeing what features are more or less important at each stage of the two-step procedure, though I expect overall accuracy to suffer somewhat. Down the road it would be interesting to compare this to other methods, like trees and boosting.
The target variable is the sum of new gifts and commitments from 9/1/16 to 8/31/18 (FY17-18), given the state of the database on 8/31/16 (FY16).
As a general principle, point-in-time data is derived from entered date ranges where possible. Where dates are missing, it will be based upon the date added or date modified audit trail for each field, as suitable. The following data types received this treatment:
This is implemented by this SQL code.
# Parameters
train_fy <- 2016
filepath <- 'data/2018-11-30 point-in-time data.xlsx'
sheetname <- 'Select point_in_time_model'
# Import data
source('code/generate-pit-data.R')
# Run data generation function
modeling.data <- generate_pit_data(filepath, sheetname)
# Create response variables
modeling.data <- modeling.data %>% mutate(
rv.amt = NGC_TARGET_FY2 + NGC_TARGET_FY1
, rv.gave = rv.amt > 0
) %>% select(
# Drop future data
-NGC_TARGET_FY2
, -NGC_TARGET_FY1
, -CASH_TARGET_FY2
, -CASH_TARGET_FY1
, -PLEDGE_TARGET_FY2
, -PLEDGE_TARGET_FY1
, -AF_TARGET_FY2
, -AF_TARGET_FY1
, -CRU_TARGET_FY2
, -CRU_TARGET_FY1
) %>% filter(
# Drop entities whose RECORD_YR is after the training year
RECORD_YR <= train_fy
)
Logistic regression has been the workhorse of fundraising models for years. Some special considerations for this application:
Lifetime.Giving as a predictor if the response variable is Largest.Gift.I have previously found that penalized logistic regression, such as implemented in R by the glmnet package, works better than standard logistic regression, so that’s the technique that I’ll use here.
Here, the response variable is:
\[ Y_i = I \left( \text{FY18Giving}_i + \text{FY17Giving}_i > 0 \right) \]
I like computing random forest variable importance, e.g. Sauve & Tuleau-Malot (2014), to pre-screen variables. Define variable importance in a random forest as the change in MSE when permuting a given observation vector. One nice feature is that highly correlated variables should be similarly important.
# Sample rows
prop = 1/5 # Proportion of data to sample
set.seed(287092)
samp <- sample_n(modeling.data, size = nrow(modeling.data) * prop)
# Run Boruta algorithm
rf.vars <- Boruta(
y = as.numeric(samp$rv.gave)
, x = samp %>% select(-rv.amt, -rv.gave)
, seed = 5993207
)
rf.vars %>% print()
Boruta performed 99 iterations in 1.105775 hours.
101 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 96 more;
43 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, COMMITTEE_KSM_ACTIVE,
COMMITTEE_KSM_LDR, COMMITTEE_KSM_LDR_ACTIVE and 38 more;
7 tentative attributes left: ACTIVITIES_PFY2, ACTIVITIES_PFY3, ACTIVITIES_PFY4, HAS_BUS_EMAIL,
HOUSEHOLD_CITY and 2 more;
Save the results.
save(rf.vars, file = 'data/rf.vars.Rdata')
Plot the results.
(pmod_plot <- rf.vars %>% Borutadata() %>% Borutaplotter())
Basically, the algorithm creates dummy “shadow” variables, which are permuted versions of the explanatory variables appearing above, and random forests are fit on both the real and dummy variables. Intuitively, if replacing a variable with a randomly permuted version of itself does not reduce the random forest classifier’s accuracy, then the variable should not be included in a final model and can be discarded.
Recall that the response variable is making a new gift or commitment at any level within the next two years. From past experience, I know that most donations are outright gifts, under $1,000, and to an annual giving allocation. So the following is not too surprising:
I found these more surprising:
(recommended.vars <- TentativeRoughFix(rf.vars))
Boruta performed 99 iterations in 1.105775 hours.
Tentatives roughfixed over the last 99 iterations.
101 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 96 more;
50 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, ACTIVITIES_PFY2,
ACTIVITIES_PFY3, ACTIVITIES_PFY4 and 45 more;
# Check variable correlations
recommended_vars <- recommended.vars$finalDecision[
which(recommended.vars$finalDecision == 'Confirmed')] %>% names()
numeric_vars <- modeling.data %>%
select(recommended_vars) %>%
select(-ID_NUMBER, -HOUSEHOLD_ID) %>%
select_if(is.numeric)
numeric_vars %>% plot_corrs(textsize = 2)
This is the correlation matrix for all 57 numeric variables confirmed important by the algorithm.
Begin by creating the modeling data file.
# Data file with variables removed
mdat <- modeling.data %>% select(rv.gave, recommended_vars) %>%
select(
-VELOCITY3_NGC, -VELOCITY_BINS_NGC, -VELOCITY_BINS_CASH, -VELOCITY3_LIN_NGC
, -GIVING_MAX_PLEDGE_YR, -GIVING_MAX_PLEDGE_FY, -CRU_STATUS
, -NGC_PFY1, -NGC_PFY2, -NGC_PFY3, -NGC_PFY4, -NGC_PFY5
, -AF_PFY1, -AF_PFY2, -AF_PFY3, -AF_PFY4, -AF_PFY5
, -GIVING_MAX_CASH_FY, -GIVING_NGC_TOTAL, -UPGRADE3_NGC, -LOYAL_5_PCT_ANY
, -DEGREES_CONCAT, -BIRTH_DT, -FIRST_KSM_YEAR
, -ID_NUMBER, -INSTITUTIONAL_SUFFIX # Keep HHID but don't use in modeling
, -KSM_GOS, -HOUSEHOLD_COUNTRY
, -KSM_EVENTS_ATTENDED, -EVENTS_ATTENDED
) %>% mutate(
# Create spouse flag
SPOUSE_ALUM = ifelse(SPOUSE_FIRST_KSM_YEAR > 0, 'TRUE', 'FALSE') %>% factor()
) %>% mutate_if(
# Numeric variables over 1E4 get a log10 transformation
function(x) {
ifelse(is.numeric(x), max(x) >= 1E4, FALSE)
}
, log10plus1
)
# Cross-validation settings
folds = 10
reps = 5
# Withhold 10% of data as test set
xv <- KFoldXVal(mdat, k = 2, prop = .1, seed = 4960582)
holdoutdat <- mdat[xv[[1]], ]
traindat <- mdat[xv[[2]], ]
remove(xv)
I’ll use a penalized ridge regression model as implemented by glmnet. Advantages of shrinkage techniques include automatically controlling for overfitting and collinearity.
# Store timings
timestamps <- list()
# Store model errors
glm_nospline <- list()
# Seed for reproducibility
set.seed(2934223)
# Outer loop (repetitions)
for (rep in 1:reps) {
# Status report
timestamp <- paste('+ Iteration', rep, 'beginning at:', Sys.time())
print(timestamp)
timestamps <- c(timestamps, timestamp)
# Create cross-validation indices
xv <- KFoldXVal(traindat, k = folds)
# Inner loop (parallel cross-validation)
errs_out <- foreach(
fold = 1:length(xv)
, .combine = c
, .packages = c('glmnet', 'glmnetUtils', 'dplyr', 'splines')
) %dopar% {
# Fit temp model, where alpha = 0 is the ridge regression penalty
tmpmodel <- cv.glmnet(
rv.gave ~ .
# Train while withholding some data
, data = traindat[-xv[[fold]], ] %>% select(-HOUSEHOLD_ID)
, family = 'binomial'
, alpha = 0
, lambda = 2^(-8:5)
)
# Prediction threshold
theta1 <- sum(traindat$rv.gave[-xv[[fold]]] == 1) / nrow(traindat[-xv[[fold]], ])
# Confusion matrix based on the withheld data
tmpconfus <- conf_matrix_glmnet(tmpmodel, newdata = traindat[xv[[fold]], ], rv = 'rv.gave', threshold = theta1)
# Return results
return(
list(
conf_matrix = tmpconfus$conf_matrix
, conf_matrix_pct = tmpconfus$conf_matrix_pct
, errors = data.frame(
reps = rep
, folds = fold
, error = tmpconfus$error
, precision = tmpconfus$precision
, sensitivity = tmpconfus$sensitivity
, F1_score = tmpconfus$F1_score
)
)
)
}
# Write results to errors data frame
glm_nospline <- c(glm_nospline, errs_out)
# Status report
timestamp <- paste(' -Iteration', rep, 'ending at: ', Sys.time())
print(timestamp)
timestamps <- c(timestamps, timestamp)
}
glm_nospline_timestamps %>% unlist() %>% print()
[1] "+ Iteration 1 beginning at: 2018-12-10 17:17:23"
[2] " -Iteration 1 ending at: 2018-12-10 17:19:48"
[3] "+ Iteration 2 beginning at: 2018-12-10 17:19:48"
[4] " -Iteration 2 ending at: 2018-12-10 17:22:01"
[5] "+ Iteration 3 beginning at: 2018-12-10 17:22:01"
[6] " -Iteration 3 ending at: 2018-12-10 17:24:00"
[7] "+ Iteration 4 beginning at: 2018-12-10 17:24:00"
[8] " -Iteration 4 ending at: 2018-12-10 17:25:58"
[9] "+ Iteration 5 beginning at: 2018-12-10 17:25:58"
[10] " -Iteration 5 ending at: 2018-12-10 17:27:53"
# Function to reshape list data
combine_xval <- function(xval_results = list()) {
# Function to reformat list output into groups
delister <- function(full_list, first_idx = 1, seq) {
output <- list()
idx <- seq(first_idx, length(full_list), by = seq)
for (i in 1:length(idx)) {
output <- c(output, full_list[idx[i]])
}
return(output)
}
# Separate the output into groups of 3
conf_matrix = delister(xval_results, 1, 3)
conf_matrix_pct = delister(xval_results, 2, 3)
errors = delister(xval_results, 3, 3)
# Turn errors into a data frame
errors <- foreach(i = 1:length(errors), .combine = rbind) %do% {
return(errors[[i]])
} %>% data.frame()
# Return organized list
return(
list(
conf_matrix = conf_matrix
, conf_matrix_pct = conf_matrix_pct
, errors = errors
)
)
}
# Save results
glm_ridge_baseline_results <- combine_xval(glm_nospline)
glm_ridge_baseline_timestamps <- timestamps
glm_ridge_baseline_model <- cv.glmnet(
rv.gave ~ .
, data = traindat %>% select(-HOUSEHOLD_ID)
, family = 'binomial'
, alpha = 0
, lambda = 2^(-8:5)
)
save(
glm_ridge_baseline_model
, glm_ridge_baseline_results
, glm_ridge_baseline_timestamps
, file = 'data/glm_ridge_baseline.Rdata'
)
grid.arrange(
histogrammer(glm_ridge_baseline_results$errors, 'error', h = .0005, fill = 'pink')
, histogrammer(glm_ridge_baseline_results$errors, 'precision', h = .005, fill = 'cyan')
, histogrammer(glm_ridge_baseline_results$errors, 'sensitivity', h = .005, fill = 'green')
)
Let TP, TN, FP, FN refer to true positives, true negatives, false positives, and false negatives respectively.
\[ \text{error} = \frac{FP + FN}{n}\] \[ \text{precision} = \frac{TP}{TP + FP}\] \[ \text{sensitivity} = \frac{TP}{TP + FN}\]
Compared to the AF $10K model, this has higher error due to the decreased sensitivity, but much higher precision.
The metrics to beat so far:
(
glm_baseline_err <- data.frame(
glm_ridge_baseline = glm_ridge_baseline_results$errors %>%
select(-reps, -folds) %>%
colMeans()
)
)
Consider a standard logistic regression model to get a better sense of the explanatory variables.
glm_standard <- glm(
rv.gave ~ .
, data = traindat %>% select(-HOUSEHOLD_ID) %>%
select(-RECORD_STATUS_CODE) # Results in separation if included
, family = 'binomial'
)
summary(glm_standard)
Call:
glm(formula = rv.gave ~ ., family = "binomial", data = traindat %>%
select(-HOUSEHOLD_ID) %>% select(-RECORD_STATUS_CODE))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5698 -0.2622 -0.1272 -0.0243 3.6243
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.798e+01 1.999e+01 -4.900 9.57e-07 ***
PROGRAM_GROUPNONE -1.206e+00 3.627e-01 -3.324 0.000887 ***
PROGRAM_GROUPEMP -4.329e-01 1.929e-01 -2.245 0.024789 *
PROGRAM_GROUPEXECED -2.832e+00 9.097e-01 -3.113 0.001850 **
PROGRAM_GROUPNONGRD -1.257e+01 2.815e+02 -0.045 0.964391
PROGRAM_GROUPPHD -1.890e+00 1.368e+00 -1.382 0.167061
PROGRAM_GROUPTMP -1.737e-01 1.602e-01 -1.084 0.278260
SPOUSE_FIRST_KSM_YEAR 4.710e-03 2.367e-02 0.199 0.842280
PREF_ADDR_TYPE_CODEALT -1.648e-01 5.394e-01 -0.306 0.759927
PREF_ADDR_TYPE_CODEBUS -3.586e-02 2.267e-01 -0.158 0.874304
HOUSEHOLD_CONTINENT -2.809e+00 6.082e-01 -4.619 3.85e-06 ***
HOUSEHOLD_CONTINENTAfrica 9.811e-01 1.065e+00 0.921 0.357081
HOUSEHOLD_CONTINENTAsia 5.199e-02 2.452e-01 0.212 0.832105
HOUSEHOLD_CONTINENTEurope -7.461e-01 3.957e-01 -1.885 0.059381 .
HOUSEHOLD_CONTINENTOceania -2.049e+00 1.329e+00 -1.542 0.123086
HOUSEHOLD_CONTINENTSouth America -1.949e+00 1.046e+00 -1.864 0.062390 .
BUS_IS_EMPLOYEDTRUE 3.523e-01 1.882e-01 1.872 0.061231 .
HAS_HOME_ADDRTRUE -2.556e-01 1.218e-01 -2.099 0.035809 *
HAS_HOME_PHONETRUE -3.122e-01 1.275e-01 -2.449 0.014333 *
HAS_HOME_EMAILTRUE 8.856e-02 1.508e-01 0.587 0.557122
GIVING_FIRST_YEAR -1.669e-02 1.181e-02 -1.413 0.157645
GIVING_FIRST_YEAR_CASH_AMT -1.882e-01 8.411e-02 -2.237 0.025286 *
GIVING_FIRST_YEAR_PLEDGE_AMT -5.385e-02 9.753e-02 -0.552 0.580864
GIVING_MAX_CASH_AMT 7.247e-02 2.975e-01 0.244 0.807528
GIVING_MAX_PLEDGE_AMT 9.012e-01 1.526e+00 0.591 0.554811
GIVING_CASH_TOTAL 2.071e-01 4.065e-01 0.510 0.610398
GIVING_PLEDGE_TOTAL -1.195e+00 1.587e+00 -0.753 0.451433
GIVING_AF_TOTAL -6.665e-02 9.750e-02 -0.684 0.494195
GIVING_CRU_TOTAL 4.157e-01 2.734e-01 1.520 0.128386
GIFTS_ALLOCS_SUPPORTED 1.548e-01 6.779e-02 2.284 0.022366 *
GIFTS_FYS_SUPPORTED 1.588e-02 3.467e-02 0.458 0.646889
GIFTS_CASH 2.269e-02 2.581e-02 0.879 0.379359
GIFTS_CREDIT_CARD1 -1.025e-02 1.969e-01 -0.052 0.958484
GIFTS_CREDIT_CARD2+ 1.385e-01 2.129e-01 0.651 0.515173
GIFTS_OUTRIGHTS_PAYMENTS -1.398e-02 2.709e-02 -0.516 0.605902
GIFTS_PLEDGES 1.351e-01 2.217e-01 0.610 0.542139
CASH_PFY1 4.891e-01 3.632e-01 1.347 0.178071
CASH_PFY2 8.663e-02 2.403e-01 0.361 0.718426
CASH_PFY3 -2.935e-01 2.254e-01 -1.302 0.192870
CASH_PFY4 4.748e-01 1.692e-01 2.806 0.005015 **
CASH_PFY5 -2.687e-01 2.410e-01 -1.115 0.264958
CRU_PFY1 -6.172e-01 3.499e-01 -1.764 0.077698 .
CRU_PFY2 -3.765e-01 2.671e-01 -1.410 0.158682
CRU_PFY3 1.566e-01 2.306e-01 0.679 0.497256
CRU_PFY4 -2.260e-01 1.653e-01 -1.367 0.171634
CRU_PFY5 -5.680e-02 1.478e-01 -0.384 0.700809
CRU_GIVING_SEGMENTDonor 3.524e+01 2.368e+01 1.489 0.136593
CRU_GIVING_SEGMENTLapsed 3.313e+01 2.364e+01 1.401 0.161065
CRU_GIVING_SEGMENTLoyal 2 of 3 3.625e+01 2.370e+01 1.530 0.126069
CRU_GIVING_SEGMENTLoyal 3+ 3.751e+01 2.370e+01 1.583 0.113399
CRU_GIVING_SEGMENTLYBUNT 3.544e+01 2.369e+01 1.496 0.134578
CRU_GIVING_SEGMENTNon 3.391e+01 2.365e+01 1.434 0.151670
CRU_GIVING_SEGMENTPYBUNT 3.391e+01 2.367e+01 1.433 0.151937
GIFT_CLUB_KLC_YRS -2.815e-02 7.498e-02 -0.375 0.707291
GIFT_CLUB_LOYAL_YRS 1.975e-02 9.810e-02 0.201 0.840415
GIFT_CLUBS_CFY 5.778e-02 7.710e-02 0.749 0.453643
GIFT_CLUBS_PFY1 4.801e-03 1.052e-01 0.046 0.963588
GIFT_CLUBS_PFY2 4.405e-02 9.820e-02 0.449 0.653733
EVALUATION_LOWER_BOUND 8.567e-03 4.108e-02 0.209 0.834820
UOR_LOWER_BOUND -4.966e-02 5.701e-02 -0.871 0.383762
KSM_GOS_FLAGTRUE 1.514e-02 4.640e-01 0.033 0.973968
MONTHS_ASSIGNED 2.248e-02 9.112e-03 2.467 0.013612 *
COMMITTEE_NU_DISTINCT -1.917e-01 9.978e-02 -1.922 0.054644 .
COMMITTEE_NU_YEARS -2.897e-02 5.205e-02 -0.557 0.577818
COMMITTEE_KSM_DISTINCT 2.481e-01 1.040e-01 2.386 0.017044 *
COMMITTEES_CFY 1.254e-01 1.313e-01 0.955 0.339577
COMMITTEES_PFY1 -3.455e-02 9.711e-02 -0.356 0.722012
COMMITTEES_PFY2 1.075e-01 1.209e-01 0.889 0.374113
COMMITTEES_PFY3 1.161e-03 1.118e-01 0.010 0.991712
EVENTS_YRS 1.716e-02 2.741e-02 0.626 0.531396
EVENTS_PREV_3_FY -1.463e-01 9.332e-02 -1.567 0.117063
KSM_EVENTS_YRS 4.654e-02 8.970e-02 0.519 0.603836
KSM_EVENTS_PREV_3_FY -9.455e-02 5.290e-02 -1.788 0.073846 .
KSM_EVENTS_REUNIONS1 8.428e-02 1.704e-01 0.495 0.620834
KSM_EVENTS_REUNIONS2 -2.092e-01 2.670e-01 -0.784 0.433302
KSM_EVENTS_REUNIONS3+ -3.511e-01 5.883e-01 -0.597 0.550603
EVENTS_CFY 2.425e-01 9.632e-02 2.517 0.011819 *
EVENTS_PFY1 1.644e-01 9.885e-02 1.663 0.096307 .
ATHLETICS_TICKET_YEARS 6.593e-02 8.951e-02 0.737 0.461423
ATHLETICS_TICKET_LAST -4.022e-04 2.589e-04 -1.554 0.120288
RECORD_YR 2.354e-02 8.329e-03 2.826 0.004708 **
GIVING_MAX_CASH_YR 2.346e-02 9.417e-03 2.491 0.012733 *
GIVING_MAX_CASH_MO2 6.467e-01 3.771e-01 1.715 0.086386 .
GIVING_MAX_CASH_MO3 8.039e-01 3.543e-01 2.269 0.023286 *
GIVING_MAX_CASH_MO4 7.149e-01 3.477e-01 2.056 0.039777 *
GIVING_MAX_CASH_MO5 2.531e-01 3.446e-01 0.735 0.462625
GIVING_MAX_CASH_MO6 5.137e-01 3.423e-01 1.501 0.133439
GIVING_MAX_CASH_MO7 6.040e-01 3.711e-01 1.628 0.103566
GIVING_MAX_CASH_MO8 6.181e-01 3.474e-01 1.779 0.075187 .
GIVING_MAX_CASH_MO9 7.939e-01 4.284e-01 1.853 0.063841 .
GIVING_MAX_CASH_MO10 6.190e-01 3.912e-01 1.583 0.113523
GIVING_MAX_CASH_MO11 3.670e-01 3.679e-01 0.997 0.318531
GIVING_MAX_CASH_MO12 7.403e-01 3.200e-01 2.314 0.020690 *
GIVING_MAX_PLEDGE_MO2 -8.841e-02 2.928e-01 -0.302 0.762704
GIVING_MAX_PLEDGE_MO3 -2.501e-01 2.836e-01 -0.882 0.377797
GIVING_MAX_PLEDGE_MO4 -2.535e-01 3.070e-01 -0.826 0.409040
GIVING_MAX_PLEDGE_MO5 1.772e-01 2.699e-01 0.657 0.511493
GIVING_MAX_PLEDGE_MO6 -9.003e-02 2.736e-01 -0.329 0.742085
GIVING_MAX_PLEDGE_MO7 -2.314e-01 3.437e-01 -0.673 0.500823
GIVING_MAX_PLEDGE_MO8 3.592e-02 2.990e-01 0.120 0.904382
GIVING_MAX_PLEDGE_MO9 -2.618e-01 3.425e-01 -0.764 0.444638
GIVING_MAX_PLEDGE_MO10 -1.701e-01 2.691e-01 -0.632 0.527445
GIVING_MAX_PLEDGE_MO11 -1.630e-01 2.804e-01 -0.581 0.561071
GIVING_MAX_PLEDGE_MO12 -7.720e-02 3.134e-01 -0.246 0.805456
KSM_PROSPECTNo -3.179e-01 2.607e-01 -1.220 0.222567
KSM_PROSPECTPast -6.016e-02 3.465e-01 -0.174 0.862167
VISITORS_5FY 1.316e-02 8.759e-02 0.150 0.880559
LOYAL_5_PCT_CASH 4.913e+00 2.574e+00 1.909 0.056324 .
UPGRADE3_CASH-1 5.199e-01 5.744e-01 0.905 0.365413
UPGRADE3_CASH0 1.240e-01 5.811e-01 0.213 0.831096
UPGRADE3_CASH1 -3.543e-01 6.801e-01 -0.521 0.602450
UPGRADE3_CASH2 -3.322e-01 7.657e-01 -0.434 0.664372
VELOCITY3_CASH -4.990e-01 4.077e-01 -1.224 0.221049
VELOCITY3_LIN_CASH 2.039e-01 1.009e-01 2.022 0.043220 *
SPOUSE_ALUMTRUE -9.020e+00 4.731e+01 -0.191 0.848797
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5263.0 on 8455 degrees of freedom
Residual deviance: 2575.4 on 8341 degrees of freedom
AIC: 2805.4
Number of Fisher Scoring iterations: 16
summary(glm_standard, corr = TRUE)$correlation %>%
data.frame() %>%
plot_corrs()
Pretty eyewatering. Look at the term plots.
termplot(glm_standard)
Definitely keep:
Definitely drop:
Needs transformation:
Duplicative:
glm_standard <- glm_standard %>% update(
data = traindat %>% select(
-HOUSEHOLD_ID
, -RECORD_STATUS_CODE
# Drop
, PREF_ADDR_TYPE_CODE
, -HAS_HOME_EMAIL
, -GIFTS_CREDIT_CARD
, -contains('GIFT_CLUB')
, -KSM_GOS_FLAG
, -KSM_EVENTS_REUNIONS
, -GIVING_MAX_PLEDGE_MO
# Duplicative
, -GIVING_FIRST_YEAR_PLEDGE_AMT
, -GIVING_MAX_CASH_AMT
, -GIVING_AF_TOTAL
, -GIFTS_OUTRIGHTS_PAYMENTS
, -contains('CRU_PFY')
, -contains('COMMITTEES_')
, -contains('EVENTS_YRS')
, -VELOCITY3_CASH
, -SPOUSE_FIRST_KSM_YEAR
)
)
summary(glm_standard)
Call:
glm(formula = rv.gave ~ ., family = "binomial", data = traindat %>%
select(-HOUSEHOLD_ID, -RECORD_STATUS_CODE, PREF_ADDR_TYPE_CODE,
-HAS_HOME_EMAIL, -GIFTS_CREDIT_CARD, -contains("GIFT_CLUB"),
-KSM_GOS_FLAG, -KSM_EVENTS_REUNIONS, -GIVING_MAX_PLEDGE_MO,
-GIVING_FIRST_YEAR_PLEDGE_AMT, -GIVING_MAX_CASH_AMT,
-GIVING_AF_TOTAL, -GIFTS_OUTRIGHTS_PAYMENTS, -contains("CRU_PFY"),
-contains("COMMITTEES_"), -contains("EVENTS_YRS"), -VELOCITY3_CASH,
-SPOUSE_FIRST_KSM_YEAR))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5370 -0.2692 -0.1294 -0.0245 3.7065
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.017e+02 1.923e+01 -5.287 1.24e-07 ***
PROGRAM_GROUPNONE -1.162e+00 3.373e-01 -3.447 0.000568 ***
PROGRAM_GROUPEMP -4.934e-01 1.825e-01 -2.703 0.006862 **
PROGRAM_GROUPEXECED -2.963e+00 9.079e-01 -3.263 0.001102 **
PROGRAM_GROUPNONGRD -1.266e+01 2.816e+02 -0.045 0.964143
PROGRAM_GROUPPHD -2.012e+00 1.419e+00 -1.418 0.156239
PROGRAM_GROUPTMP -2.223e-01 1.486e-01 -1.496 0.134743
PREF_ADDR_TYPE_CODEALT -2.486e-01 5.404e-01 -0.460 0.645539
PREF_ADDR_TYPE_CODEBUS -7.062e-02 2.234e-01 -0.316 0.751967
HOUSEHOLD_CONTINENT -2.890e+00 5.923e-01 -4.880 1.06e-06 ***
HOUSEHOLD_CONTINENTAfrica 8.810e-01 1.063e+00 0.829 0.407218
HOUSEHOLD_CONTINENTAsia -6.824e-05 2.399e-01 0.000 0.999773
HOUSEHOLD_CONTINENTEurope -7.322e-01 3.895e-01 -1.880 0.060149 .
HOUSEHOLD_CONTINENTOceania -2.061e+00 1.305e+00 -1.579 0.114238
HOUSEHOLD_CONTINENTSouth America -1.925e+00 1.042e+00 -1.848 0.064664 .
BUS_IS_EMPLOYEDTRUE 3.378e-01 1.837e-01 1.839 0.065924 .
HAS_HOME_ADDRTRUE -2.619e-01 1.193e-01 -2.195 0.028186 *
HAS_HOME_PHONETRUE -3.149e-01 1.251e-01 -2.518 0.011816 *
GIVING_FIRST_YEAR -1.234e-02 1.120e-02 -1.102 0.270426
GIVING_FIRST_YEAR_CASH_AMT -1.908e-01 7.208e-02 -2.647 0.008117 **
GIVING_MAX_PLEDGE_AMT 1.043e+00 1.465e+00 0.712 0.476325
GIVING_CASH_TOTAL 5.653e-01 2.276e-01 2.484 0.012993 *
GIVING_PLEDGE_TOTAL -1.367e+00 1.525e+00 -0.896 0.370042
GIVING_CRU_TOTAL -7.557e-02 2.131e-01 -0.355 0.722892
GIFTS_ALLOCS_SUPPORTED 1.276e-01 6.515e-02 1.959 0.050084 .
GIFTS_FYS_SUPPORTED 2.597e-02 2.562e-02 1.014 0.310719
GIFTS_CASH 8.815e-03 1.919e-02 0.459 0.646013
GIFTS_PLEDGES 1.928e-01 2.109e-01 0.914 0.360692
CASH_PFY1 7.492e-02 1.608e-01 0.466 0.641184
CASH_PFY2 -3.255e-02 1.077e-01 -0.302 0.762609
CASH_PFY3 -5.704e-02 1.125e-01 -0.507 0.612028
CASH_PFY4 2.587e-01 7.488e-02 3.455 0.000551 ***
CASH_PFY5 -3.546e-01 1.885e-01 -1.881 0.059941 .
CRU_GIVING_SEGMENTDonor 2.651e+01 2.248e+01 1.180 0.238143
CRU_GIVING_SEGMENTLapsed 2.481e+01 2.244e+01 1.105 0.269032
CRU_GIVING_SEGMENTLoyal 2 of 3 2.698e+01 2.249e+01 1.200 0.230278
CRU_GIVING_SEGMENTLoyal 3+ 2.790e+01 2.249e+01 1.241 0.214691
CRU_GIVING_SEGMENTLYBUNT 2.643e+01 2.249e+01 1.175 0.239895
CRU_GIVING_SEGMENTNon 2.460e+01 2.243e+01 1.097 0.272799
CRU_GIVING_SEGMENTPYBUNT 2.548e+01 2.247e+01 1.134 0.256807
EVALUATION_LOWER_BOUND 1.921e-02 4.007e-02 0.479 0.631616
UOR_LOWER_BOUND -4.549e-02 5.614e-02 -0.810 0.417820
MONTHS_ASSIGNED 2.216e-02 8.422e-03 2.631 0.008522 **
COMMITTEE_NU_DISTINCT -9.981e-02 8.664e-02 -1.152 0.249313
COMMITTEE_NU_YEARS -8.453e-03 4.998e-02 -0.169 0.865692
COMMITTEE_KSM_DISTINCT 1.818e-01 8.902e-02 2.042 0.041147 *
EVENTS_PREV_3_FY -7.890e-02 7.756e-02 -1.017 0.309019
KSM_EVENTS_PREV_3_FY -8.378e-02 4.333e-02 -1.934 0.053142 .
EVENTS_CFY 1.986e-01 9.019e-02 2.202 0.027693 *
EVENTS_PFY1 1.185e-01 9.302e-02 1.274 0.202536
ATHLETICS_TICKET_YEARS 8.211e-02 8.698e-02 0.944 0.345143
ATHLETICS_TICKET_LAST -3.888e-04 2.550e-04 -1.525 0.127343
RECORD_YR 2.495e-02 7.838e-03 3.183 0.001458 **
GIVING_MAX_CASH_YR 2.403e-02 8.989e-03 2.674 0.007498 **
GIVING_MAX_CASH_MO2 5.846e-01 3.646e-01 1.603 0.108914
GIVING_MAX_CASH_MO3 7.174e-01 3.368e-01 2.130 0.033157 *
GIVING_MAX_CASH_MO4 6.155e-01 3.343e-01 1.842 0.065543 .
GIVING_MAX_CASH_MO5 1.971e-01 3.329e-01 0.592 0.553758
GIVING_MAX_CASH_MO6 3.999e-01 3.303e-01 1.211 0.225984
GIVING_MAX_CASH_MO7 4.621e-01 3.605e-01 1.282 0.199935
GIVING_MAX_CASH_MO8 5.937e-01 3.301e-01 1.799 0.072092 .
GIVING_MAX_CASH_MO9 6.199e-01 4.046e-01 1.532 0.125510
GIVING_MAX_CASH_MO10 5.522e-01 3.749e-01 1.473 0.140754
GIVING_MAX_CASH_MO11 2.486e-01 3.556e-01 0.699 0.484461
GIVING_MAX_CASH_MO12 6.464e-01 3.088e-01 2.093 0.036334 *
KSM_PROSPECTNo -2.354e-01 2.544e-01 -0.925 0.354866
KSM_PROSPECTPast -2.091e-02 3.420e-01 -0.061 0.951234
VISITORS_5FY 4.505e-02 8.552e-02 0.527 0.598326
LOYAL_5_PCT_CASH 5.521e+00 2.307e+00 2.394 0.016682 *
UPGRADE3_CASH-1 1.597e-01 5.574e-01 0.287 0.774433
UPGRADE3_CASH0 -1.311e-01 5.664e-01 -0.231 0.816964
UPGRADE3_CASH1 -6.675e-01 6.616e-01 -1.009 0.313008
UPGRADE3_CASH2 -8.606e-01 7.197e-01 -1.196 0.231751
VELOCITY3_LIN_CASH 1.538e-01 9.372e-02 1.641 0.100738
SPOUSE_ALUMTRUE 3.724e-01 2.599e-01 1.433 0.151775
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5263.0 on 8455 degrees of freedom
Residual deviance: 2605.3 on 8381 degrees of freedom
AIC: 2755.3
Number of Fisher Scoring iterations: 16
That’s a nice drop in the AIC.
dfs <- 4
Now introduce splines on the numeric variables, arbitrarily setting df = 4.
glm_st_splines <- glm(
rv.gave ~
PROGRAM_GROUP +
PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED +
HAS_HOME_ADDR +
HAS_HOME_PHONE +
ns(YEARS_SINCE_FIRST_GIFT, df = dfs) +
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs) +
ns(GIVING_MAX_PLEDGE_AMT, df = dfs) +
ns(GIVING_CASH_TOTAL, df = dfs) +
ns(GIVING_PLEDGE_TOTAL, df = dfs) +
ns(GIVING_CRU_TOTAL, df = dfs) +
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs) +
ns(GIFTS_FYS_SUPPORTED, df = dfs) +
ns(GIFTS_CASH, df = dfs) +
ns(GIFTS_PLEDGES, df = dfs) +
ns(CASH_PFY1, df = dfs) +
ns(CASH_PFY2, df = dfs) +
ns(CASH_PFY3, df = dfs) +
ns(CASH_PFY4, df = dfs) +
ns(CASH_PFY5, df = dfs) +
CRU_GIVING_SEGMENT +
ns(EVALUATION_LOWER_BOUND, df = dfs) +
ns(UOR_LOWER_BOUND, df = dfs) +
ns(MONTHS_ASSIGNED, df = dfs) +
ns(COMMITTEE_NU_DISTINCT, df = dfs) +
ns(COMMITTEE_NU_YEARS, df = dfs) +
ns(COMMITTEE_KSM_DISTINCT, df = dfs) +
ns(EVENTS_PREV_3_FY, df = dfs) +
ns(EVENTS_CFY, df = dfs) +
ns(EVENTS_PFY1, df = dfs) +
ns(ATHLETICS_TICKET_YEARS, df = dfs) +
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs) +
ns(RECORD_YR, df = dfs) +
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs) +
GIVING_MAX_CASH_MO +
KSM_PROSPECT +
ns(VISITORS_5FY, df = dfs) +
LOYAL_5_PCT_CASH +
UPGRADE3_CASH +
VELOCITY3_LIN_CASH +
SPOUSE_ALUM
, data = traindat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
, family = 'binomial'
)
summary(glm_st_splines)
Call:
glm(formula = rv.gave ~ PROGRAM_GROUP + PREF_ADDR_TYPE_CODE +
HOUSEHOLD_CONTINENT + BUS_IS_EMPLOYED + HAS_HOME_ADDR + HAS_HOME_PHONE +
ns(YEARS_SINCE_FIRST_GIFT, df = dfs) + ns(GIVING_FIRST_YEAR_CASH_AMT,
df = dfs) + ns(GIVING_MAX_PLEDGE_AMT, df = dfs) + ns(GIVING_CASH_TOTAL,
df = dfs) + ns(GIVING_PLEDGE_TOTAL, df = dfs) + ns(GIVING_CRU_TOTAL,
df = dfs) + ns(GIFTS_ALLOCS_SUPPORTED, df = dfs) + ns(GIFTS_FYS_SUPPORTED,
df = dfs) + ns(GIFTS_CASH, df = dfs) + ns(GIFTS_PLEDGES,
df = dfs) + ns(CASH_PFY1, df = dfs) + ns(CASH_PFY2, df = dfs) +
ns(CASH_PFY3, df = dfs) + ns(CASH_PFY4, df = dfs) + ns(CASH_PFY5,
df = dfs) + CRU_GIVING_SEGMENT + ns(EVALUATION_LOWER_BOUND,
df = dfs) + ns(UOR_LOWER_BOUND, df = dfs) + ns(MONTHS_ASSIGNED,
df = dfs) + ns(COMMITTEE_NU_DISTINCT, df = dfs) + ns(COMMITTEE_NU_YEARS,
df = dfs) + ns(COMMITTEE_KSM_DISTINCT, df = dfs) + ns(EVENTS_PREV_3_FY,
df = dfs) + ns(EVENTS_CFY, df = dfs) + ns(EVENTS_PFY1, df = dfs) +
ns(ATHLETICS_TICKET_YEARS, df = dfs) + ns(YEARS_SINCE_ATHLETICS_TICKETS,
df = dfs) + ns(RECORD_YR, df = dfs) + ns(YEARS_SINCE_MAX_CASH_YR,
df = dfs) + GIVING_MAX_CASH_MO + KSM_PROSPECT + ns(VISITORS_5FY,
df = dfs) + LOYAL_5_PCT_CASH + UPGRADE3_CASH + VELOCITY3_LIN_CASH +
SPOUSE_ALUM, family = "binomial", data = traindat %>% mutate(YEARS_SINCE_FIRST_GIFT = 2016 -
ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017), YEARS_SINCE_ATHLETICS_TICKETS = 2016 -
ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST,
2017), YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR >
0, GIVING_MAX_CASH_YR, 2017)))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6194 -0.2643 -0.1309 -0.0165 3.6621
Coefficients: (71 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.16122 4.93638 -0.640 0.521918
PROGRAM_GROUPNONE -1.18701 0.34913 -3.400 0.000674 ***
PROGRAM_GROUPEMP -0.46667 0.18624 -2.506 0.012221 *
PROGRAM_GROUPEXECED -2.69190 0.84607 -3.182 0.001464 **
PROGRAM_GROUPNONGRD -12.70303 282.17952 -0.045 0.964093
PROGRAM_GROUPPHD -1.90850 1.36514 -1.398 0.162106
PROGRAM_GROUPTMP -0.20256 0.15013 -1.349 0.177258
PREF_ADDR_TYPE_CODEALT -0.22550 0.53301 -0.423 0.672245
PREF_ADDR_TYPE_CODEBUS -0.07024 0.22636 -0.310 0.756333
HOUSEHOLD_CONTINENT -2.81215 0.60428 -4.654 3.26e-06 ***
HOUSEHOLD_CONTINENTAfrica 0.95081 1.06315 0.894 0.371142
HOUSEHOLD_CONTINENTAsia 0.04741 0.24253 0.195 0.845025
HOUSEHOLD_CONTINENTEurope -0.75310 0.39922 -1.886 0.059240 .
HOUSEHOLD_CONTINENTOceania -2.17620 1.32330 -1.645 0.100067
HOUSEHOLD_CONTINENTSouth America -1.84224 1.04106 -1.770 0.076798 .
BUS_IS_EMPLOYEDTRUE 0.30659 0.18783 1.632 0.102625
HAS_HOME_ADDRTRUE -0.25987 0.12113 -2.145 0.031928 *
HAS_HOME_PHONETRUE -0.33062 0.12622 -2.619 0.008811 **
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)1 -0.33448 1.06622 -0.314 0.753746
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)2 0.36884 0.58139 0.634 0.525808
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)3 NA NA NA NA
ns(YEARS_SINCE_FIRST_GIFT, df = dfs)4 NA NA NA NA
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)1 0.05508 1.93420 0.028 0.977281
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)2 -1.17237 1.49336 -0.785 0.432420
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)3 NA NA NA NA
ns(GIVING_FIRST_YEAR_CASH_AMT, df = dfs)4 NA NA NA NA
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)1 -5.16523 11.22447 -0.460 0.645390
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)2 NA NA NA NA
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)3 NA NA NA NA
ns(GIVING_MAX_PLEDGE_AMT, df = dfs)4 NA NA NA NA
ns(GIVING_CASH_TOTAL, df = dfs)1 -4.44841 3.24325 -1.372 0.170190
ns(GIVING_CASH_TOTAL, df = dfs)2 0.39867 2.32841 0.171 0.864052
ns(GIVING_CASH_TOTAL, df = dfs)3 NA NA NA NA
ns(GIVING_CASH_TOTAL, df = dfs)4 NA NA NA NA
ns(GIVING_PLEDGE_TOTAL, df = dfs)1 7.54102 11.78945 0.640 0.522406
ns(GIVING_PLEDGE_TOTAL, df = dfs)2 NA NA NA NA
ns(GIVING_PLEDGE_TOTAL, df = dfs)3 NA NA NA NA
ns(GIVING_PLEDGE_TOTAL, df = dfs)4 NA NA NA NA
ns(GIVING_CRU_TOTAL, df = dfs)1 1.23036 2.61211 0.471 0.637625
ns(GIVING_CRU_TOTAL, df = dfs)2 0.47719 1.82306 0.262 0.793514
ns(GIVING_CRU_TOTAL, df = dfs)3 NA NA NA NA
ns(GIVING_CRU_TOTAL, df = dfs)4 NA NA NA NA
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)1 -3.34011 2.47439 -1.350 0.177057
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)2 -1.61705 2.38887 -0.677 0.498463
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)3 NA NA NA NA
ns(GIFTS_ALLOCS_SUPPORTED, df = dfs)4 NA NA NA NA
ns(GIFTS_FYS_SUPPORTED, df = dfs)1 -1.90529 1.65145 -1.154 0.248620
ns(GIFTS_FYS_SUPPORTED, df = dfs)2 1.45831 1.32042 1.104 0.269407
ns(GIFTS_FYS_SUPPORTED, df = dfs)3 NA NA NA NA
ns(GIFTS_FYS_SUPPORTED, df = dfs)4 NA NA NA NA
ns(GIFTS_CASH, df = dfs)1 -2.74367 2.25109 -1.219 0.222912
ns(GIFTS_CASH, df = dfs)2 -2.32943 2.21046 -1.054 0.291964
ns(GIFTS_CASH, df = dfs)3 NA NA NA NA
ns(GIFTS_CASH, df = dfs)4 NA NA NA NA
ns(GIFTS_PLEDGES, df = dfs)1 -0.79821 1.46695 -0.544 0.586353
ns(GIFTS_PLEDGES, df = dfs)2 NA NA NA NA
ns(GIFTS_PLEDGES, df = dfs)3 NA NA NA NA
ns(GIFTS_PLEDGES, df = dfs)4 NA NA NA NA
ns(CASH_PFY1, df = dfs)1 -0.61592 0.99398 -0.620 0.535492
ns(CASH_PFY1, df = dfs)2 NA NA NA NA
ns(CASH_PFY1, df = dfs)3 NA NA NA NA
ns(CASH_PFY1, df = dfs)4 NA NA NA NA
ns(CASH_PFY2, df = dfs)1 -0.07331 0.74215 -0.099 0.921314
ns(CASH_PFY2, df = dfs)2 NA NA NA NA
ns(CASH_PFY2, df = dfs)3 NA NA NA NA
ns(CASH_PFY2, df = dfs)4 NA NA NA NA
ns(CASH_PFY3, df = dfs)1 0.05490 0.83274 0.066 0.947438
ns(CASH_PFY3, df = dfs)2 NA NA NA NA
ns(CASH_PFY3, df = dfs)3 NA NA NA NA
ns(CASH_PFY3, df = dfs)4 NA NA NA NA
ns(CASH_PFY4, df = dfs)1 -1.74854 0.53466 -3.270 0.001074 **
ns(CASH_PFY4, df = dfs)2 NA NA NA NA
ns(CASH_PFY4, df = dfs)3 NA NA NA NA
ns(CASH_PFY4, df = dfs)4 NA NA NA NA
ns(CASH_PFY5, df = dfs)1 1.32060 1.36500 0.967 0.333309
ns(CASH_PFY5, df = dfs)2 NA NA NA NA
ns(CASH_PFY5, df = dfs)3 NA NA NA NA
ns(CASH_PFY5, df = dfs)4 NA NA NA NA
CRU_GIVING_SEGMENTDonor 1.42674 0.74401 1.918 0.055159 .
CRU_GIVING_SEGMENTLapsed -0.25900 0.66392 -0.390 0.696460
CRU_GIVING_SEGMENTLoyal 2 of 3 1.71931 0.72091 2.385 0.017082 *
CRU_GIVING_SEGMENTLoyal 3+ 2.65804 0.79205 3.356 0.000791 ***
CRU_GIVING_SEGMENTLYBUNT 1.25671 0.70525 1.782 0.074760 .
CRU_GIVING_SEGMENTNon -0.36379 1.13524 -0.320 0.748623
CRU_GIVING_SEGMENTPYBUNT 0.44559 0.66525 0.670 0.502984
ns(EVALUATION_LOWER_BOUND, df = dfs)1 -0.14357 0.36318 -0.395 0.692610
ns(EVALUATION_LOWER_BOUND, df = dfs)2 NA NA NA NA
ns(EVALUATION_LOWER_BOUND, df = dfs)3 NA NA NA NA
ns(EVALUATION_LOWER_BOUND, df = dfs)4 NA NA NA NA
ns(UOR_LOWER_BOUND, df = dfs)1 0.39065 0.51079 0.765 0.444391
ns(UOR_LOWER_BOUND, df = dfs)2 NA NA NA NA
ns(UOR_LOWER_BOUND, df = dfs)3 NA NA NA NA
ns(UOR_LOWER_BOUND, df = dfs)4 NA NA NA NA
ns(MONTHS_ASSIGNED, df = dfs)1 -2.64816 0.99027 -2.674 0.007491 **
ns(MONTHS_ASSIGNED, df = dfs)2 NA NA NA NA
ns(MONTHS_ASSIGNED, df = dfs)3 NA NA NA NA
ns(MONTHS_ASSIGNED, df = dfs)4 NA NA NA NA
ns(COMMITTEE_NU_DISTINCT, df = dfs)1 12.12375 5.37110 2.257 0.023994 *
ns(COMMITTEE_NU_DISTINCT, df = dfs)2 14.63410 5.32911 2.746 0.006031 **
ns(COMMITTEE_NU_DISTINCT, df = dfs)3 NA NA NA NA
ns(COMMITTEE_NU_DISTINCT, df = dfs)4 NA NA NA NA
ns(COMMITTEE_NU_YEARS, df = dfs)1 1.55301 1.87288 0.829 0.406987
ns(COMMITTEE_NU_YEARS, df = dfs)2 0.02111 1.87781 0.011 0.991032
ns(COMMITTEE_NU_YEARS, df = dfs)3 NA NA NA NA
ns(COMMITTEE_NU_YEARS, df = dfs)4 NA NA NA NA
ns(COMMITTEE_KSM_DISTINCT, df = dfs)1 -0.83379 1.56126 -0.534 0.593308
ns(COMMITTEE_KSM_DISTINCT, df = dfs)2 -1.65900 1.10190 -1.506 0.132177
ns(COMMITTEE_KSM_DISTINCT, df = dfs)3 NA NA NA NA
ns(COMMITTEE_KSM_DISTINCT, df = dfs)4 NA NA NA NA
ns(EVENTS_PREV_3_FY, df = dfs)1 11.43632 8.43833 1.355 0.175328
ns(EVENTS_PREV_3_FY, df = dfs)2 NA NA NA NA
ns(EVENTS_PREV_3_FY, df = dfs)3 NA NA NA NA
ns(EVENTS_PREV_3_FY, df = dfs)4 NA NA NA NA
ns(EVENTS_CFY, df = dfs)1 -12.26972 5.95506 -2.060 0.039362 *
ns(EVENTS_CFY, df = dfs)2 NA NA NA NA
ns(EVENTS_CFY, df = dfs)3 NA NA NA NA
ns(EVENTS_CFY, df = dfs)4 NA NA NA NA
ns(EVENTS_PFY1, df = dfs)1 -5.03194 3.84315 -1.309 0.190423
ns(EVENTS_PFY1, df = dfs)2 NA NA NA NA
ns(EVENTS_PFY1, df = dfs)3 NA NA NA NA
ns(EVENTS_PFY1, df = dfs)4 NA NA NA NA
ns(ATHLETICS_TICKET_YEARS, df = dfs)1 -0.07030 0.71623 -0.098 0.921812
ns(ATHLETICS_TICKET_YEARS, df = dfs)2 NA NA NA NA
ns(ATHLETICS_TICKET_YEARS, df = dfs)3 NA NA NA NA
ns(ATHLETICS_TICKET_YEARS, df = dfs)4 NA NA NA NA
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)1 2.22483 1.14494 1.943 0.051994 .
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)2 NA NA NA NA
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)3 NA NA NA NA
ns(YEARS_SINCE_ATHLETICS_TICKETS, df = dfs)4 NA NA NA NA
ns(RECORD_YR, df = dfs)1 1.70554 2.37588 0.718 0.472846
ns(RECORD_YR, df = dfs)2 1.32818 1.67565 0.793 0.427992
ns(RECORD_YR, df = dfs)3 2.56396 5.21086 0.492 0.622690
ns(RECORD_YR, df = dfs)4 1.93846 0.85378 2.270 0.023180 *
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)1 -0.51665 0.36524 -1.415 0.157197
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)2 -0.26841 1.47337 -0.182 0.855444
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)3 -7.41608 6.61100 -1.122 0.261957
ns(YEARS_SINCE_MAX_CASH_YR, df = dfs)4 -13.15816 12.91073 -1.019 0.308125
GIVING_MAX_CASH_MO2 0.58410 0.36639 1.594 0.110890
GIVING_MAX_CASH_MO3 0.72485 0.33811 2.144 0.032048 *
GIVING_MAX_CASH_MO4 0.61009 0.33484 1.822 0.068454 .
GIVING_MAX_CASH_MO5 0.22521 0.33464 0.673 0.500960
GIVING_MAX_CASH_MO6 0.43381 0.33461 1.296 0.194809
GIVING_MAX_CASH_MO7 0.50672 0.36145 1.402 0.160938
GIVING_MAX_CASH_MO8 0.66170 0.33205 1.993 0.046287 *
GIVING_MAX_CASH_MO9 0.64405 0.40846 1.577 0.114849
GIVING_MAX_CASH_MO10 0.59427 0.37560 1.582 0.113609
GIVING_MAX_CASH_MO11 0.24460 0.35634 0.686 0.492452
GIVING_MAX_CASH_MO12 0.66246 0.30957 2.140 0.032363 *
KSM_PROSPECTNo -0.24378 0.25672 -0.950 0.342319
KSM_PROSPECTPast 0.06131 0.34486 0.178 0.858886
ns(VISITORS_5FY, df = dfs)1 -1.07886 2.04243 -0.528 0.597344
ns(VISITORS_5FY, df = dfs)2 NA NA NA NA
ns(VISITORS_5FY, df = dfs)3 NA NA NA NA
ns(VISITORS_5FY, df = dfs)4 NA NA NA NA
LOYAL_5_PCT_CASH 3.66823 2.50201 1.466 0.142617
UPGRADE3_CASH-1 0.02998 0.56231 0.053 0.957478
UPGRADE3_CASH0 -0.11388 0.57355 -0.199 0.842611
UPGRADE3_CASH1 -0.65798 0.67103 -0.981 0.326819
UPGRADE3_CASH2 -0.93325 0.73096 -1.277 0.201696
VELOCITY3_LIN_CASH 0.15308 0.09574 1.599 0.109836
SPOUSE_ALUMTRUE 0.38958 0.26223 1.486 0.137375
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5263.0 on 8455 degrees of freedom
Residual deviance: 2588.9 on 8366 degrees of freedom
AIC: 2768.9
Number of Fisher Scoring iterations: 16
termplot(glm_st_splines)
Some more thoughts.
glm_st_splines <- glm(
rv.gave ~
PROGRAM_GROUP +
HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED +
HAS_HOME_ADDR +
HAS_HOME_PHONE +
# YEARS_SINCE_FIRST_GIFT +
GIVING_FIRST_YEAR_CASH_AMT +
# GIVING_MAX_PLEDGE_AMT +
GIVING_CASH_TOTAL +
# GIVING_PLEDGE_TOTAL +
# GIVING_CRU_TOTAL +
# sqrt(GIFTS_ALLOCS_SUPPORTED) +
sqrt(GIFTS_FYS_SUPPORTED) +
# sqrt(GIFTS_CASH) +
# sqrt(GIFTS_PLEDGES) +
# CASH_PFY1 +
# CASH_PFY2 +
# CASH_PFY3 +
CASH_PFY4 +
CASH_PFY5 +
CRU_GIVING_SEGMENT +
# EVALUATION_LOWER_BOUND +
# UOR_LOWER_BOUND +
sqrt(MONTHS_ASSIGNED) +
# sqrt(COMMITTEE_NU_DISTINCT) +
# sqrt(COMMITTEE_NU_YEARS) +
# sqrt(COMMITTEE_KSM_DISTINCT) +
# sqrt(EVENTS_PREV_3_FY) +
sqrt(EVENTS_CFY) +
# sqrt(EVENTS_PFY1) +
# sqrt(ATHLETICS_TICKET_YEARS) +
YEARS_SINCE_ATHLETICS_TICKETS +
ns(RECORD_YR, df = 5) +
YEARS_SINCE_MAX_CASH_YR +
GIVING_MAX_CASH_MO +
# KSM_PROSPECT +
# sqrt(VISITORS_5FY) +
LOYAL_5_PCT_CASH +
# UPGRADE3_CASH +
VELOCITY3_LIN_CASH +
SPOUSE_ALUM
, data = traindat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
, family = 'binomial'
)
summary(glm_st_splines)
Call:
glm(formula = rv.gave ~ PROGRAM_GROUP + HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED + HAS_HOME_ADDR + HAS_HOME_PHONE + GIVING_FIRST_YEAR_CASH_AMT +
GIVING_CASH_TOTAL + sqrt(GIFTS_FYS_SUPPORTED) + CASH_PFY4 +
CASH_PFY5 + CRU_GIVING_SEGMENT + sqrt(MONTHS_ASSIGNED) +
sqrt(EVENTS_CFY) + YEARS_SINCE_ATHLETICS_TICKETS + ns(RECORD_YR,
df = 5) + YEARS_SINCE_MAX_CASH_YR + GIVING_MAX_CASH_MO +
LOYAL_5_PCT_CASH + VELOCITY3_LIN_CASH + SPOUSE_ALUM, family = "binomial",
data = traindat %>% mutate(YEARS_SINCE_FIRST_GIFT = 2016 -
ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017),
YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST >
0, ATHLETICS_TICKET_LAST, 2017), YEARS_SINCE_MAX_CASH_YR = 2016 -
ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR,
2017)))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6681 -0.2635 -0.1355 -0.0279 3.4887
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.459908 3.240969 -1.993 0.046239 *
PROGRAM_GROUPNONE -0.881538 0.317046 -2.780 0.005428 **
PROGRAM_GROUPEMP -0.466860 0.176695 -2.642 0.008237 **
PROGRAM_GROUPEXECED -2.378178 0.747189 -3.183 0.001458 **
PROGRAM_GROUPNONGRD -12.734917 283.582936 -0.045 0.964181
PROGRAM_GROUPPHD -1.893617 1.351830 -1.401 0.161280
PROGRAM_GROUPTMP -0.216800 0.146585 -1.479 0.139138
HOUSEHOLD_CONTINENT -2.818500 0.588387 -4.790 1.67e-06 ***
HOUSEHOLD_CONTINENTAfrica 0.877588 1.058118 0.829 0.406886
HOUSEHOLD_CONTINENTAsia 0.116692 0.237399 0.492 0.623042
HOUSEHOLD_CONTINENTEurope -0.663975 0.387007 -1.716 0.086223 .
HOUSEHOLD_CONTINENTOceania -1.945458 1.257454 -1.547 0.121829
HOUSEHOLD_CONTINENTSouth America -1.764793 1.035977 -1.704 0.088474 .
BUS_IS_EMPLOYEDTRUE 0.431911 0.184991 2.335 0.019556 *
HAS_HOME_ADDRTRUE -0.322682 0.117715 -2.741 0.006121 **
HAS_HOME_PHONETRUE -0.306948 0.124627 -2.463 0.013781 *
GIVING_FIRST_YEAR_CASH_AMT -0.093560 0.066105 -1.415 0.156974
GIVING_CASH_TOTAL 0.227620 0.125721 1.811 0.070216 .
sqrt(GIFTS_FYS_SUPPORTED) 0.397117 0.098505 4.031 5.54e-05 ***
CASH_PFY4 0.254317 0.064851 3.922 8.80e-05 ***
CASH_PFY5 -0.351803 0.178015 -1.976 0.048126 *
CRU_GIVING_SEGMENTDonor 1.225055 0.374310 3.273 0.001065 **
CRU_GIVING_SEGMENTLapsed -0.076892 0.331285 -0.232 0.816458
CRU_GIVING_SEGMENTLoyal 2 of 3 1.680229 0.373244 4.502 6.74e-06 ***
CRU_GIVING_SEGMENTLoyal 3+ 2.651960 0.415307 6.386 1.71e-10 ***
CRU_GIVING_SEGMENTLYBUNT 1.282945 0.382352 3.355 0.000792 ***
CRU_GIVING_SEGMENTNon -0.189812 0.360712 -0.526 0.598739
CRU_GIVING_SEGMENTPYBUNT 0.559489 0.344457 1.624 0.104320
sqrt(MONTHS_ASSIGNED) 0.114177 0.051137 2.233 0.025565 *
sqrt(EVENTS_CFY) 0.205010 0.076427 2.682 0.007309 **
YEARS_SINCE_ATHLETICS_TICKETS -0.223747 0.089858 -2.490 0.012774 *
ns(RECORD_YR, df = 5)1 2.219727 3.087909 0.719 0.472237
ns(RECORD_YR, df = 5)2 3.027579 3.287942 0.921 0.357148
ns(RECORD_YR, df = 5)3 1.556128 2.153782 0.723 0.469981
ns(RECORD_YR, df = 5)4 4.497431 6.547683 0.687 0.492162
ns(RECORD_YR, df = 5)5 1.786750 1.068678 1.672 0.094539 .
YEARS_SINCE_MAX_CASH_YR -0.028530 0.008864 -3.218 0.001289 **
GIVING_MAX_CASH_MO2 0.546028 0.355209 1.537 0.124243
GIVING_MAX_CASH_MO3 0.681945 0.328873 2.074 0.038118 *
GIVING_MAX_CASH_MO4 0.548626 0.324177 1.692 0.090576 .
GIVING_MAX_CASH_MO5 0.208072 0.323859 0.642 0.520564
GIVING_MAX_CASH_MO6 0.356457 0.322310 1.106 0.268750
GIVING_MAX_CASH_MO7 0.360217 0.351486 1.025 0.305440
GIVING_MAX_CASH_MO8 0.514667 0.321437 1.601 0.109346
GIVING_MAX_CASH_MO9 0.611605 0.396902 1.541 0.123330
GIVING_MAX_CASH_MO10 0.582833 0.362755 1.607 0.108124
GIVING_MAX_CASH_MO11 0.250833 0.347161 0.723 0.469971
GIVING_MAX_CASH_MO12 0.635602 0.299055 2.125 0.033556 *
LOYAL_5_PCT_CASH 5.479513 2.212318 2.477 0.013256 *
VELOCITY3_LIN_CASH 0.119396 0.049366 2.419 0.015580 *
SPOUSE_ALUMTRUE 0.409974 0.253411 1.618 0.105701
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5263.0 on 8455 degrees of freedom
Residual deviance: 2638.8 on 8405 degrees of freedom
AIC: 2740.8
Number of Fisher Scoring iterations: 16
Fit a logistic regression model with the ridge penalizer using the same subset of variables chosen in the previuos step.
remove(glm_ridge)
glm_ridge_cv <- cv.glmnet(
rv.gave ~
PROGRAM_GROUP +
HOUSEHOLD_CONTINENT +
BUS_IS_EMPLOYED +
HAS_HOME_ADDR +
HAS_HOME_PHONE +
# YEARS_SINCE_FIRST_GIFT +
GIVING_FIRST_YEAR_CASH_AMT +
# GIVING_MAX_PLEDGE_AMT +
GIVING_CASH_TOTAL +
# GIVING_PLEDGE_TOTAL +
# GIVING_CRU_TOTAL +
# sqrt(GIFTS_ALLOCS_SUPPORTED) +
sqrt(GIFTS_FYS_SUPPORTED) +
# sqrt(GIFTS_CASH) +
# sqrt(GIFTS_PLEDGES) +
# CASH_PFY1 +
# CASH_PFY2 +
# CASH_PFY3 +
CASH_PFY4 +
CASH_PFY5 +
CRU_GIVING_SEGMENT +
# EVALUATION_LOWER_BOUND +
# UOR_LOWER_BOUND +
sqrt(MONTHS_ASSIGNED) +
# sqrt(COMMITTEE_NU_DISTINCT) +
# sqrt(COMMITTEE_NU_YEARS) +
# sqrt(COMMITTEE_KSM_DISTINCT) +
# sqrt(EVENTS_PREV_3_FY) +
sqrt(EVENTS_CFY) +
# sqrt(EVENTS_PFY1) +
# sqrt(ATHLETICS_TICKET_YEARS) +
YEARS_SINCE_ATHLETICS_TICKETS +
ns(RECORD_YR, df = 5) +
YEARS_SINCE_MAX_CASH_YR +
GIVING_MAX_CASH_MO +
# KSM_PROSPECT +
# sqrt(VISITORS_5FY) +
LOYAL_5_PCT_CASH +
# UPGRADE3_CASH +
VELOCITY3_LIN_CASH +
SPOUSE_ALUM
, data = traindat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
, family = 'binomial'
, alpha = 0 # Ridge penalty
)
Compare coefficients between the penalized and unpenalized models.
full_join(
data.frame(var = coef(glm_st_splines) %>% names(), unpenalized = coef(glm_st_splines))
, data.frame(var = coef(glm_ridge_cv)[, 1] %>% names(), shrinkage = coef(glm_ridge_cv)[, 1])
, by = c('var', 'var')
) %>% gather(model, 'coefficient', 2:3) %>%
na.omit() %>%
arrange(abs(coefficient) %>% desc()) %>%
ggplot(aes(x = var %>% reorder(-abs(coefficient)), y = coefficient, color = model)) +
geom_hline(yintercept = 0, color = 'darkgray') +
geom_point(alpha = .5) +
scale_y_continuous(trans = 'neg_sqrt', breaks = c(-50, -40, -30, seq(-20, 20, by = 5), -2, -.5, .5, 2)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .3)
, panel.grid.minor = element_line(linetype = 'dotted')) +
labs(x = 'var')
The ridge penalty leads to fairly aggressive coefficient shrinkage.
# Holdout data with new variables
holdout_new <- holdoutdat %>% mutate(
YEARS_SINCE_FIRST_GIFT = 2016 - ifelse(GIVING_FIRST_YEAR > 0, GIVING_FIRST_YEAR, 2017)
, YEARS_SINCE_ATHLETICS_TICKETS = 2016 - ifelse(ATHLETICS_TICKET_LAST > 0, ATHLETICS_TICKET_LAST, 2017)
, YEARS_SINCE_MAX_CASH_YR = 2016 - ifelse(GIVING_MAX_CASH_YR > 0, GIVING_MAX_CASH_YR, 2017)
)
# Threshold
theta1 <- sum(traindat$rv.gave) / nrow(traindat)
# Calculations
tmp.ns <- conf_matrix(glm_standard, newdata = holdoutdat, threshold = theta1)
tmp.s <- conf_matrix(glm_st_splines, newdata = holdout_new, threshold = theta1)
tmp.rs <- conf_matrix_glmnet(glm_ridge_cv, newdata = holdout_new, rv = 'rv.gave', threshold = theta1)
# Data frame
model_compare <- cbind(
glm_baseline_err
, glm_nospline = c(tmp.ns$err, tmp.ns$prec, tmp.ns$sens, tmp.ns$F1)
, glm_spline = c(tmp.s$err, tmp.s$prec, tmp.s$sens, tmp.ns$F1)
, glm_ridge = c(tmp.rs$err, tmp.rs$prec, tmp.rs$sens, tmp.rs$F1)
)
remove(tmp.ns, tmp.s, tmp.rs)
print(model_compare)
With threshold \(\theta =\) 0.092 the glm_ridge model is the winner.
# Calculateions
tmp.ns <- conf_matrix(glm_standard, newdata = holdoutdat)
tmp.s <- conf_matrix(glm_st_splines, newdata = holdout_new)
tmp.rs <- conf_matrix_glmnet(glm_ridge_cv, newdata = holdout_new, rv = 'rv.gave')
# Data frame
model_compare <- cbind(
glm_baseline_err
, glm_nospline = c(tmp.ns$err, tmp.ns$prec, tmp.ns$sens, tmp.ns$F1)
, glm_spline = c(tmp.s$err, tmp.s$prec, tmp.s$sens, tmp.ns$F1)
, glm_ridge = c(tmp.rs$err, tmp.rs$prec, tmp.rs$sens, tmp.rs$F1)
)
remove(tmp.ns, tmp.s, tmp.rs)
print(model_compare)
But with a decision threshold of \(\theta =\) 0.5 the standard glm performs somewhat better, minimizing false negatives.
Consider the calibration plots.
smooth.method <- 'loess'
glm_preds <- data.frame(
class = (holdoutdat[, 1] + 0) %>% unlist()
, ridge.baseline = predict(glm_ridge_baseline_model, newdata = holdout_new, type = 'response')
, nospline = predict(glm_standard, newdata = holdout_new, type = 'response')
, spline = predict(glm_st_splines, newdata = holdout_new, type = 'response')
, ridge = predict(glm_ridge_cv, newdata = holdout_new, type = 'response')
) %>% setNames(
c('class', 'ridge.baseline', 'nospline', 'spline', 'ridge')
) %>% gather(
'model', 'prediction', ridge.baseline:ridge
)
# Plotting
glm_preds %>%
ggplot(aes(x = prediction, y = class, group = model, color = model)) +
geom_point(color = 'black', alpha = .1) +
geom_smooth(method = smooth.method, alpha = .5) +
geom_abline(slope = 1, intercept = 0) +
labs(title = paste0('Predictions with OOS smoother (', smooth.method, ')'), color = 'model'
, x = 'predicted probability'
, y = 'observed probability')
Interestingly, out-of-box baseline ridge regression outperforms the ridge regression model with fewer explanatory variables. Between these four I’d take the nospline glm due to its interpretability.
We can also look at the ROC curves.
rocdat <- cbind(model = 'ridge.baseline', roc_matrix_gen(glm_ridge_baseline_model, data = holdout_new)) %>%
rbind(cbind(model = 'nospline', roc_matrix_gen(glm_standard, data = holdout_new))) %>%
rbind(cbind(model = 'spline', roc_matrix_gen(glm_st_splines, data = holdout_new))) %>%
rbind(cbind(model = 'ridge', roc_matrix_gen(glm_ridge_cv, data = holdout_new)))
# Plot results
rocdat %>%
ggplot(aes(x = FPR, y = TPR, color = model)) +
geom_line(size = 1) +
geom_abline(slope = 1, intercept = 0, linetype = 'dashed', col = 'black') +
scale_x_continuous(breaks = seq(0, 1, by = .1), expand = c(0, 0)) +
scale_y_continuous(breaks = seq(0, 1, by = .1), expand = c(0, 0)) +
coord_equal() +
labs(title = 'ROC plot')
Computing the AUC:
data.frame(
ridge.baseline = with(
rocdat %>% filter(model == 'ridge.baseline')
, sum(1/nrow(holdoutdat) * TPR)
)
, nospline = with(
rocdat %>% filter(model == 'nospline')
, sum(1/nrow(holdoutdat) * TPR)
)
, spline = with(
rocdat %>% filter(model == 'spline')
, sum(1/nrow(holdoutdat) * TPR)
)
, ridge = with(
rocdat %>% filter(model == 'ridge')
, sum(1/nrow(holdoutdat) * TPR)
)
)
These are pretty similar, but again the nospline glm appears to be a reasonable choice.
I found previously that linear regression actually works pretty well for predicting cumulative giving amounts, particularly when conditioning on donor status (thereby excluding all 0 entries).
\[ E \left(\text{log giving | donor, covariates} \right) = E\left(\text{log}\left(Y_i\right)\right) = X_i \boldsymbol{\beta} \]
Here, \(Y_i = \text{FY18Giving}_i + \text{FY17Giving}_i\) and the training data \(X_i\) includes the observations where \(Y_i > 0\).
I’ll again pre-screen variables with the Boruta algorithm, but this time the response variable is continuous and only donors will be included.
# Sample rows
prop = 1/5 # Proportion of data to sample
set.seed(7968177)
# Include only entities who gave
samp <- sample_n(
modeling.data %>% filter(rv.gave)
, size = nrow(modeling.data %>% filter(rv.gave)) * prop
) %>% select(
-rv.gave, -ID_NUMBER, -HOUSEHOLD_ID, -INSTITUTIONAL_SUFFIX, -DEGREES_CONCAT
)
# Run Boruta algorithm, but only on entities that gave
rf.vars.lm <- Boruta(
y = log10(samp$rv.amt)
, x = samp %>% select(-rv.amt)
, seed = 3371662
)
rf.vars.lm %>% print()
Boruta performed 99 iterations in 8.94825 mins.
63 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 58 more;
76 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, ACTIVITIES_PFY2,
ACTIVITIES_PFY3, ACTIVITIES_PFY4 and 71 more;
8 tentative attributes left: CRU_GIVING_SEGMENT, CRU_STATUS, EVENTS_ATTENDED, GAVE_10K,
GIFTS_MATCHES and 3 more;
Save the results.
save(rf.vars.lm, file = 'data/rf.vars.lm.Rdata')
(lmod_plot <- rf.vars.lm %>% Borutadata() %>% Borutaplotter())
Some unsurprising findings:
And some surprising ones:
(recommended.vars.lm <- TentativeRoughFix(rf.vars.lm))
Boruta performed 99 iterations in 8.94825 mins.
Tentatives roughfixed over the last 99 iterations.
69 attributes confirmed important: AF_PFY1, AF_PFY2, AF_PFY3, AF_PFY4, AF_PFY5 and 64 more;
78 attributes confirmed unimportant: ACTIVITIES_CFY, ACTIVITIES_PFY1, ACTIVITIES_PFY2,
ACTIVITIES_PFY3, ACTIVITIES_PFY4 and 73 more;
# Check variable correlations
recommended_vars_lm <- recommended.vars.lm$finalDecision[
which(recommended.vars.lm$finalDecision == 'Confirmed')] %>% names()
numeric_vars_lm <- modeling.data %>%
filter(rv.gave) %>%
select(recommended_vars_lm) %>%
select_if(is.numeric)
numeric_vars_lm %>% plot_corrs(textsize = 2)